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Abstract 

Background: Genome-wide mapping of protein-DNA interactions has been widely used to investigate biological 
functions of the genome. An important question is to what extent such interactions are regulated at the DNA 
sequence level. However, current investigation is hampered by the lack of computational methods for systematic 
evaluating sequence specificity. 

Results: We present a simple, unbiased quantitative measure for DNA sequence specificity called the Motif 
Independent Measure (MIM). By analyzing both simulated and real experimental data, we found that the MIM 
measure can be used to detect sequence specificity independent of presence of transcription factor (TF) binding 
motifs. We also found that the level of specificity associated with H3K4me1 target sequences is highly cell-type 
specific and highest in embryonic stem (ES) cells. We predicted H3K4me1 target sequences by using the N- score 
model and found that the prediction accuracy is indeed high in ES cells.The software to compute the MIM is freely 
available at: https://github.com/lucapinello/mim. 

Conclusions: Our method provides a unified framework for quantifying DNA sequence specificity and serves as a 
guide for development of sequence-based prediction models. 



Background 

Of the entire 3GB human genome, only about 2% codes 
for proteins. The identification of biological functions of 
the entire genome remains a major challenge [1,2]. One 
powerful venue to gain functional insights is to identify 
the proteins that bind to each genomic region. Recent 
development of chromatin immunoprecipitation followed 
by microarray or sequencing (ChIP- chip or ChlPseq) 
technologies has made it feasible to map genome-wide 
protein-DNA interaction profiles [3-5]. The data generated 
by these experiments have not only greatly facilitated the 
genome-wide characterization of regulatory elements such 
as enhancers [6,7] but also been integrated with other data 
sources to build gene regulatory networks [8-11]. 

An important question is to what extent a specific pro- 
tein-DNA interaction is mediated at the level of genomic 
sequences. While it is well known that specific sequence 
motifs are crucial for transcription factors (TF) mediated 
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ds-regulation, there are many other proteins, such as 
chromatin modifiers, whose target sequences cannot sim- 
ply be characterized by a handful of distinct motifs [12]. 
Such sequences are often regarded as nonspecific and 
not studied further. However, recent studies in nucleo- 
some positioning have provided new insights by going 
beyond this motif-centric view [13]. Here various 
sequence features have been associated with nucleosome 
positioning, including poly dA:dT track [14,15], abun- 
dance of G/C content [16,17], and certain periodic pat- 
terns [18,19]. Such patterns cannot be captured by 
traditional motif analysis methods. Similar results have 
been obtained by analyzing histone modification [20,21] 
and DNA methylation data [22,23]. 

Despite the success of these recent sequence-based 
prediction models, it remains difficult to determine 
which sequences lack intrinsic specificity because a poor 
prediction outcome might imply than more sophisticated 
models. A guide is needed for developing sequence-based 
prediction models. To this end, here we present a simple 
approach to quantify sequence specificity based on the 
frequency distribution of /c-mers. We will also systemati- 
cally investigate the relative merit of various distance or 
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similarity functions for capturing specific sequence infor- 
mation. While /c-mers have been extensively to detect 
splice sites [24], to study functional genomic regions [25], 
to identify protein coding genes [26] and used in motif 
analysis (reviewed by [27]), to our knowledge, they have 
not been used to quantify sequence specificity. 

We evaluated the performance of our approach by ana- 
lyzing one simulated datasets and two real experimental 
datasets, corresponding to a TF (STAT1) and a histone 
modification (H3K4mel) respectively. Our results have 
provided new insights into the role of DNA sequences in 
modulating protein-DNA interactions regardless of motif 
presence. 

Results 

A simple measure of sequence specificity 

While specific sequence information has been identified 
in the absence of distinct motifs, to our knowledge, it is 
always associated with enrichment of certain /c-mers 
(where k is a small number, such as 4). Its main differ- 
ence with motifs is that, when k is small, a single /c-mer 
may occur many times in the genome and therefore 
would not be useful for any practical purpose. On the 
other hand, we reasoned that more specific information 
can be obtained by combinations of multiple /c-mers. 
Therefore, it seems appropriate to quantify sequence spe- 
cificity by aggregating enrichment information for all /<- 
mers. For the rest of the paper, we fix k = 4, although the 
method presented below is equally applicable to any 
choice of k. Treating complementary sequences as identi- 
cal, there are 136 non-redundant 4-mers. By counting the 
frequency of each 4-mer, each input sequence is then 
mapped to a 136 dimensional numerical vector contain- 
ing the frequency of each /c-mer. The distributions corre- 
sponding to sequences containing specific information 
should be distinct from those for random sequences, 
which are generated to match the number and length of 
the input sequences. We use the symmetric Kullback- 
Leibler (KL) divergence [28] for comparing frequency 
distributions and average over the entire set of input 
sequences. We term the resulting value as the Motif 
independent Metric (MIM). To evaluate statistical signif- 
icance, we estimate the null distribution by computing 
MIM values for sets of random sequences. The detailed 
procedure is described in the Methods section. 

Model Validation 
Simulated data 

As an initial evaluation, we synthetically generated 8 
sequence sets each containing 2000 sequences, mimicking 
TF ChlPseq experiments for which the corresponding TF 
recognizes a single motif: TTGACA. The difference 
between these sequence sets is the motif strength, which is 
parameterized by a real number s (see Methods). In 



particular, a perfect motif corresponds to s = 0, whereas a 
random sequence corresponds to £ = 0.25. In a typical 
ChlPseq experiment, only a subset of target sequences 
contains the motif. To simulate this fact, we randomly 
selected 1000 sequences from each set and inserted the 
motif at a randomly selected location. As control, we also 
synthesized 1000 sets of 2000 random sequences each. 

We calculated the MIM values for each sequence set 
and evaluated the statistical significance of the resulting 
values. We found that the MIM values are statistically sig- 
nificant (p-value < 0.001) for s up to 0.1 (Figure la and 
lb). The information content for the corresponding motif 
is 5.35 bit, which is still lower than 98% of the motifs in 
the JASPAR core database [29]. In the following we will 
show that our method indeed performs well for real data. 
We ranked each /c-mer according to its relative contribu- 
tion to the MIM. The most informative /c-mers are shown 
in Table 1. The methodology used to select such motif is 
outlined in the methods section. We noticed that the top 
/c-mers are substrings of the inserted motif (highlighted in 
bold in Table 1), suggesting that these /c-mers may be used 
as a seed for motif detection, in a similar way as the dic- 
tionary approach [30] . In additional to the KL divergence 
considered here, there are a number of other metrics to 
compare frequency distributions. We selected a few com- 
monly used metrics and repeated the above analysis 
(Methods). We found that the results are quite similar 
(Table 2). 
Real ChlPseq data 

To validate our method using real experimental data, we 
analyzed a publicly available ChlPseq dataset for STAT1 
[31], a member of the signal transducer and activator of 
transcription (STAT) family TFs, in the HeLa S3 cell line. 
The dataset contains 39,000 target sequences, 35% of 
which contains the consensus motif TTCCNGGAA (JAS- 
PAR database [29]). As control, we sampled random 
sequences from genomic background matching the num- 
ber and length of the target sequences. 

We evaluated the level of sequence specificity of the 
whole set of target sequences by using the MIM measure. 
The sequences are indeed highly specific (see Figure 2a 
and 2b). Again, among the top ranked /c-mers, several are 
substrings of the "classic" STAT1 motif (highlighted in 
bold in Table 3), suggesting it may provide useful informa- 
tion for identifying discriminative sequence signatures 
without the knowledge of TF motifs. Furthermore, the 
results are not sensitive to the specific choice of distances 
as in the simulated data experiment (Table 4). 

Detecting sequence specificity in absence of a dominant 

motif 

STAT1 

As mentioned above, while the presence of STAT1 
motif can explain the sequence specificity for 35% of the 
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Figure 1 MIM values for simulated sequences, (a) The MIM values and corresponding p-values (above the bars) for the simulated data. Note 
that the MIM values change in the same direction as motif strength; (b) comparison of the MIM values with respect to the null distribution, 
which is obtained by using 1000 sets of random sequences. 



target sequences, it is unclear how TF is recruited to the 
other 65% of the targets. In order to evaluate the role of 
DNA sequence specificity for these motif- absent targets, 
we compared the MIM values between the motif-pre- 
sent and motif-absent subsets of targets. Surprisingly, 
we found that the MIM value for motif- absent targets is 
almost indistinguishable from motif-present targets (see 
Figure 2a and 2b). This high level of specificity cannot 
be simply explained by promoter-related biases, because 
only 11% of target sequences are located in promoters. 

To gain mechanistic insights, we searched for enrich- 
ment of other TF motifs in the JASPAR database [29], 
using the FIMO software [32]. We found two motifs that 
are significantly enriched (threshold p-value < 10" 6 ): SP1 
and ESR1, both have previously been shown to interact 
with STAT1 [33,34]. Therefore, STAT1 might be recruited 
to the motif- absent targets through interaction with these 
other TFs. We further compared the associated gene 
ontology terms between the motif-present and motif- 
absent sets to see if there are any functional differences. 
We found that these two sets share many similar biologi- 
cal functions, such as hydrolase and ATPase activities (p < 
10" 17 ). On the other hand, while the motif-present targets 
are highly enriched for the voltage-gated calcium channel 



complex (p < 10" 12 ), the motif-absent targets are highly 
enriched for cytoplasmic components instead (p < 10~ 12 ). 
H3K4me1 

Unlike TFs, histone (de)modifying enzymes usually do not 
directly interact with DNA. The role of DNA sequences in 
the regulation of histone modification patterns remains 
poorly understood. As an example, the histone modifica- 
tion H3K4mel plays an important role in gene regulation 
by demarcating cell-type specific enhancers [6]; yet how it 
is recruited to enhancer regions is poorly understood. We 
hypothesized that the role of DNA sequence may play a 
cell-type specific role and aimed to detect such differences 
by using our MIM measure. To this end, we assembled an 
H3K4mel ChlPseq dataset in seven human cell-lines, 
including HI (a human embryonic stem cell line), K562 (a 
myelogenous leukemia cell line), Huvec (human umbilical 
vein endothelial cells), Nhek (normal human epidermal 
keratinocytes), and three T cell-lines (CD4+, CD36+, and 
CD133+) from the public domain [1,4,35]. For each cell 
line, we identified the peak locations by using cisGenome 
[36] then calculated the MIM value for DNA sequences at 
the peaks (in Table 5 the top 20 k-mers ranking by differ- 
ent distances on HI cell line). The MIM values are highly 
cell-type specific (see Figure 3a and 3b and Table 6). 
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Table 1 Top 20 fc-mers ranked by different distances on 
Celll of Synthetic dataset 

Celll 



KL Bhattacharyya Hellinger 



tea a 


tcaa 


tcaa 


gaca 


gaca 


gaca 


gtca 


gtca 


gtca 


acag 


acag 


acag 


caag 


caag 


caag 


attg 


caac 


attg 


acat 


acac 


acat 


caac 


attg 


caac 


acac 


acat 


acac 


aatg 


aatg 


aatg 


acaa 


acaa 


acaa 


ttaa 


egga 


ttaa 


aaat 


caaa 


aaat 


egga 


gacc 


egga 


caaa 


agat 


caaa 


aatt 


aaat 


aatt 


cata 


taaa 


cata 


gacc 


cege 


gacc 


agat 


cgee 


agat 


gtaa 


aagg 


gtaa 


agat 


cgee 


agat 



Interestingly, the value for HI cells is much higher than 
any other cell line, suggesting that the DNA sequence 
plays a unique role in H3K4mel recruitment in ES cells. 
To eliminate the possibility that this difference may be 
simply due to a GC content related bias, we repeated the 
analysis by using a different null model, obtained by ran- 
dom shuffling the original sequences within each dataset. 
While the MIM values slightly change, they are ordered in 
nearly the same way as before (Additional File 1). Impor- 
tantly, the MIM values are distinctively higher in the HI 
cell line compared to the other cell lines, suggesting that 
such differences are unlikely due to a GC- content related 
bias. 

Since the H3K4mel marks cell-type specific enhan- 
cers, one possible explanation for the high sequence 



Table 2 Distances values on Synthetic dataset 



Cell 


KL 


p-value 


Bhattacharyya 


p-value 


Hellinger 


p-value 


1 


4.12E-03 


<0.001 


2.18E-03 


<0.001 


2.67E-02 


<0.001 


2 


3.64E-03 


<0.001 


1.93E-03 


<0.001 


2.51 E-02 


<0.001 


3 


4.06E-03 


<0.001 


2.16E-03 


<0.001 


2.65 E-02 


<0.001 


4 


3.23E-03 


<0.001 


1 .70E-03 


<0.001 


2.36E-02 


<0.001 


5 


2.06E-03 


<0.001 


1 .07E-03 


<0.001 


1 .89E-02 


<0.001 


6 


9.59E-04 


<0.001 


5.03E-04 


<0.001 


1 .29E-02 


<0.001 


7 


7.80E-04 


0.0262 


4.09E-04 


0.0497 


1.16E-02 


0.0207 


8 


6.27E-04 


0.2367 


3.75E-04 


0.1670 


1 .04E-02 


0.2467 



specificity in ES cells is that the targets might be asso- 
ciated with a few ES-specific TFs. To test this possibility, 
we searched for enrichment of TF motifs in the J AS PAR 
database using FIMO. Surprisingly, we were unable to 
find any significantly-enriched motif, suggesting that the 
specificity is contributed to a different mechanism. 

We then investigated whether the H3K4mel targets in 
ES cells are indeed highly predictable. In previous work, 
we developed a sequence-based model, called the N- 
score model, to predict epigenetic targets [19,21]. This 
model integrates information from three classes of 
sequence features (sequence periodicity, word counts, 
and DNA structural parameters) by using stepwise logis- 
tic regression model (see methods for details). Here we 
applied the N-score model to predict H3K4mel target 
sequences. As negative control, we selected the same 
number of sequences from the genome at random. We 
evaluated the model performance by using a 3-fold cross- 
validation. We found that prediction accuracy is indeed 
high for ES cells (AUC = 0.967) (Figure 4), whereas the 
accuracy for other cell types is much lower. 

Discussion 

Recently it has been shown that a large number of pro- 
teins may weakly bind to DNA [37]. It remains unclear 
to what extent such events are mediated by specific 
sequence information. This question cannot be 
answered by using traditional motif analysis, since the 
target sequences do not contain distinct motifs. As an 
alternative approach, we define a simple measure, called 
MIM, to quantify sequence specificity by aggregating 
information from all /c-mers. Our approach does not 
make any assumptions regarding motif presence, provid- 
ing a more versatile tool for sequence analysis. We vali- 
dated this method by analyzing both simulated and 
experimental data and found that it is indeed effective 
for detecting sequence specificity in both cases. 

We also showed that the MIM measure can provide 
new biological insights. Specifically, we found that the 
motif-absent targets of a TF may also contain specific 
sequence information due to interaction with other TFs. 
We also found that the sequence specificity for 
H3K4mel targets is higher in ES cells than in differen- 
tiated cell-types, suggesting a unique role of DNA 
sequence in the recruitment of H3K4mel in ES cells. 
Interestingly, this high specificity cannot be explained by 
enrichment of known TF motifs, suggesting a yet 
uncharacterized recruitment mechanism in ES cells. The 
MIM algorithm is implemented in Python and can be 
freely accessed at : https://github.com/lucapinello/mim. 

Conclusion 

The role of DNA sequence in gene regulation remains 
incompletely understood. Our MIM method has 



Pinello et al. BMC Bioinformatics 201 1, 12:408 
http://www.biomedcentral.eom/1 471-21 05/1 2/408 



Page 5 of 9 





0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 




0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 




0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 



(b) 



Figure 2 MIM values for STAT1 target sequences, (a) The MIM values and corresponding p-values (above the bars) for different subsets of 
STAT1 target sequences: all targets, STAT1 motif containing ones, and STAT1 motif absent ones; (b) comparison of the MIM values with respect 
to the null distribution, which is estimated by using 1000 sets of random sequences. 



Table 3 Top 20 fc-mers ranked by different distances on 
motif sequences on STAT1 dataset 



KL 


STAT1 Motif 
Bhattacharyya 


Hellinger 


aata 


atat 


aata 


ttaa 


tata 


ttaa 


aaat 


aata 


aaat 


aaaa 


ttaa 


aaaa 


ggaa 


atta 


ggaa 


atat 


aaat 


atat 


atac 


taaa 


atac 


tcaa 


atac 


tcaa 


aatt 


aatt 


aatt 


acat 


ataa 


acat 


taca 


taca 


taca 


aggg 


cata 


aggg 


egga 


aaaa 


egga 


atta 


attg 


atta 


attg 


acat 


attg 


taga 


tcaa 


taga 


caaa 


ageg 


caaa 


acta 


gata 


acta 


ccag 


taga 


ccag 


agca 


egga 


agca 



extended previous work by further accounting for 
sequence specificity due to accumulation of weak 
sequence features. The information can be used as a 
guide to systematically investigate the regulatory 
mechanisms for a wide variety of biological processes. 

Methods 

Synthetic data generation 

We simulated ChlPseq data for a TF whose motif 
sequence is TTGACA. In order to simulate the variation 
of motif sites among different target sequences, we 
modeled the position weight matrix (PWM) as illu- 
strated in Table 7, where s measures the mutation rate 
of the motif and can change between 0 (perfect motif) 
and 0.25 (totally random). We sampled s at 8 different 
values: 0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.1667, and 0.25. 
For each choice of s, we generated 2000 sequences of 
500 bp each. The sequences were initially generated by 
randomly sampling from the background distribution 
with the probabilities of A,C,G,T equal to 0.15, 0.35, 
0.35, 0.15, respectively. In addition, we randomly 
selected a subset of 1000 sequences and inserted the 
motif at a random location. 
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Table 4 Distances values on STAT1 dataset 

Peaks KL p-value Bhattacharyya p-value Hellinger p-value 

Statl Motif 3.26E-02 <0.001 2.19E-03 <0.001 7.51E-02 <0.001 

Non STAT1 Motif 3.36E-02 <0.001 2.72E-03 <0.001 7.61E-02 <0.001 

All 3.76E-02 <0.001 3.65E-03 <0.001 8.05E-02 <0.001 



ChlPseq data source 

Genome-wide STAT1 peak locations in HeLa S3 cell 
lines were obtained from the http://archive.gersteinlab. 
org/proj/PeakSeq/Scoring_ChIPSeq/Results/STATl[31]. 
ChlPseq data for H3K4mel in seven human cell lines 
were obtained from literature: CD4+ T cell [4], CD36+ 
and CD133+ T cells [35], HI, Huvec, K562, and Nhek 
[1]. The raw data were processed by cisGenome to iden- 
tify peak locations [36]. The DNA sequences at the peak 
locations were analyzed subsequently. 

Motif analysis 

Motif analysis was done by using several tools in the 
MEME suite (http://meme.nbcr.net/meme/) as follows. 
Scanning DNA sequences for matches of a known motif 
was done by using the FIMO [32]. Motif comparison 
was done by using TOMTOM. 

Functional annotation 

Functional annotation was done by using the GOrilla 
software [38] (http://cbl-gorilla.cs.technion.ac.il/). 



Table 5 Top 20 fc-mers ranking by different distances on 
HI cell line on H3k4me1 dataset 





HI cell line 




KL 


Bhattacharyya 


Hellinger 


tcga 


tcga 


tcga 


cgaa 


tcca 


cgaa 


attc 


attc 


attc 


tcca 


atgg 


tcca 


atcg 


cgaa 


atcg 


ggaa 


ggaa 


ggaa 


atgg 


aatg 


atgg 


aatg 


atcg 


aatg 


aacg 


tata 


aacg 


ctta 


ttaa 


ctta 


gcta 


ctta 


gcta 


ttaa 


aacg 


ttaa 


eta a 


gcta 


eta a 


agct 


taaa 


agct 


ggta 


ggta 


ggta 


taaa 


ataa 


egga 


egga 


egga 


taaa 


aegg 


atta 


aegg 


ataa 


aegg 


ataa 


atta 


aaaa 


atta 



Details of the MIM measure 

Each DNA sequence is mapped to numerical values by 
enumerating the frequency of each /c-mer treating com- 
plementary /c-mers as the same. There are m = 136 
non-redundant /c-mers for k = 4. MIM is essentially a 
metric between two distributions of /c-mer frequencies. 
Specifically, let P = (P t j) be the /c-mer frequency distri- 
butions corresponding to a set of n target sequences S = 
(Si), where Si represents a sequence in the set S. We 
generate a set of n random sequences R = (Rj) matching 
the sequence lengths (analogously R t represents a 
sequence in the set R). Let Q = (Q^) be the /c-mer fre- 
quency distributions corresponding to R. Finally let 

P/ = y^t. and Qj = (Pj in particular represents the 

probability of the y-th /c-mer in S, analogously, Q, repre- 
sents the probability of the y-th /c-mer in R) then the dif- 
ference between P and Q is quantified by the 
symmetrical Kullback-Leibler (KL) divergence [28], as 
follows: 



m p. m Q 



Qi p 



4/(S,R) 



The MIM value corresponding to S is defined as the 
expected value d ki (S, R), which is estimated by aver- 
aging over 1000 sets of random sequences. The MIM 
value, using the symmetrical KL divergence, can be 
interpreted as the number of the expected number of 
extra bits required to code samples from S when using 
a code based on the background distribution. Note that 
there exist several alternatives to measure the similarity 
of two probability distributions [39]. To evaluate 
whether the results are sensitive to the specific choice of 
distances, we also computed MIM values based on two 
other well-known distances between probability 
distributions: 

1) The Hellinger distance [39] 



4/(S,R) 



N > =1 



whose main differences from d ki are 1) d M naturally 
satisfies the triangle inequality; and 2) the range of dya is 
the interval [0,1]. 
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(b) 

Figure 3 MIM values for H3K4me1 target sequences, (a) The MIM values and corresponding p-values (above the bars) for H3k4me1 target 
sequences in different cell lines. Note that the MIM value for H1 is much higher than for other cell lines; (b) comparison of the MIM values with 
respect to the null distribution, which is estimated from 1000 sets of random sequences. 



2) The Bhattacharyya distance [40] 

k 2 



d hh {S,R) = J2 



(crp ; +cr Q; ) 



1 / flPj + 

+ - log I — = ^ =z 

2 \2^crp ; +cr Q; 



which has been widely used for pattern recognition in 
computer science [41]; 

1 n I 1 " 2 

where /x Pj = -J2 p ij and cr P . = / J2 { p ij ~ Mp,-) 

n i=i y n — 1 i=i 

are the mean and standard deviation, respectively, of P 7 - 
(MQj and a Qj are defined similarly for Q,). 

In order to estimate the null distribution, we gener- 
ated 1000 sets of random sequences and then calculated 



MIM values for each random sequence set. The prob- 
ability density function (pdf) was estimated by using a 
kernel method [42] . This pdf was used to infer not only 
the mean and standard deviation of the null distribution 
but also the statistical significance for any MIM value. 
Recognizing the limited resolution of the estimated pdf, 
we did not distinguish p-values that are smaller than 
0.001. 

N-score model 

The N-score model was described previously [19,21]. In 
brief, the model integrates three types of sequence fea- 
tures, including sequence periodicities [19], word counts 
[16], and structural parameters [43], a total of 2920 



Table 6 Distances values on H3k4me1 dataset 



Cell 


KL 


p-value 


Bhattacharyya 


p-value 


Hellinger 


p-value 


HI 


4.43 E-01 


<0.001 


1 .28E-02 


<0.001 


2.71 E-01 


<0.001 


Cd4+ 


1.97E-01 


<0.001 


8.47E-03 


<0.001 


1.82E-01 


<0.001 


NHEK 


1.10E-01 


<0.001 


4.10E-03 


<0.001 


1.37E-01 


<0.001 


K562 


0.083176 


<0.001 


0.003584 


<0.001 


0.119491 


<0.001 


Cd133+ 


0.064867 


<0.001 


0.006796 


<0.001 


0.105424 


<0.001 


Cd36+ 


0.026875 


<0.001 


0.002996 


<0.001 


0.067992 


<0.001 


HUVEC 


0.014557 


<0.001 


0.002912 


<0.001 


0.050102 


<0.001 
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0.4 0.6 
False positive rate 

Figure 4 N-score prediction of H3K4me1 target sequences. 

Receiver operating characteristic (ROC) curves for different cell lines 
using the N-score. Note as the AUC for H1 is much higher than for 
other cell lines. 



Table 7 PWM for synthetic motif generation 





1 


2 


3 


4 


5 


6 


A 


8 


8 


8 


1-3s 


8 


1-3e 


C 


S 


8 


8 


8 


1-38 


8 


G 


S 


8 


1-38 


8 


8 


8 


T 


1-38 


1-38 


8 


8 


8 


8 



candidate features. Model selection was done by step- 
wise logistic regression. The final model was used for 
target prediction. 

Most informative k-mers selection 

Giving P 7 - and associated to S and R respectively, it is 
possible to calculate their Kullback-Leibler (KL) diver- 
gence for each /, where j indicates the y-th /c-mer com- 
ponent. This results in a list of 136 distance values, 
whose ranking can be used as a guide to identify the 
most informative /c-mers. 

Additional material 



Additional file 1: Choice of the null model for sequence specificity. 

(a) The MIM values for H3k4me1 target sequences in different cell lines 
experiment with a null model obtained shuffling the original sequences. 

(b) The MIM values for the same experiment using as a null model a set 
of random sequences extracted from genome with matching lengths. 
Note that the the H1 cell line is far more specific than the other cell 
lines independently of the null model chosen. 
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